Load the data and preprocess it for analysis and modeling. This includes handling missing values, converting categorical variables into dummy/indicator variables, and encoding ordinal variables. 📊🔍🧹
Perform exploratory data analysis to gain insights into the dataset, understand the distributions of features, and explore potential relationships between the features and the disease outcomes. 📊🔬📉
Perform data cleansing and transformation to improve the model's performance. This includes imputing missing values and normalizing numeric features. 💡🔬
This is the age of the patient. Age is a crucial factor in disease prognosis as the risk of chronic conditions such as heart disease, cancer, diabetes, and arthritis increases with age. This is due to various factors including the cumulative effect of exposure to risk factors, increased wear and tear on the body, and changes in the body's physiological functions. 🌡️👴
This feature represents the gender of the patient. Gender can influence disease prognosis due to biological differences and gender-specific lifestyle patterns. For instance, heart disease is more common in males, while skin cancer is more common in females. This could be due to factors like longer life expectancy or different exposure to risk factors in each gender. ♀️♂️
This is a self-rated health status of the patient. Patients who perceive their health as "Poor" or "Fair" are more likely to have chronic conditions. This could be because the symptoms or management of these conditions impact their perceived health status. 💓
This feature represents the frequency of health checkups. Regular health checkups can help in early detection and management of diseases, thereby improving the prognosis. 🏥
This feature indicates whether the patient exercises regularly or not. Regular exercise can help control weight, reduce risk of heart diseases, and manage blood sugar and insulin levels, among other benefits. This aligns with the negative correlation observed between exercise and diseases such as heart disease, diabetes, and arthritis. 🏃♂️🏋️♀️
This feature indicates whether the patient has a history of smoking. Smoking can increase disease risk as it can damage blood vessels, increase blood pressure, and reduce the amount of oxygen reaching the organs. 🚬🚭
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Dataset
cardio_df=pd.read_csv(r'C:\Hero Vired practice\Project\CVD_cleaned.csv',skipinitialspace=True)
cardio_df.head()
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Poor | Within the past 2 years | No | No | No | No | No | No | Yes | Female | 70-74 | 150.0 | 32.66 | 14.54 | Yes | 0.0 | 30.0 | 16.0 | 12.0 |
| 1 | Very Good | Within the past year | No | Yes | No | No | No | Yes | No | Female | 70-74 | 165.0 | 77.11 | 28.29 | No | 0.0 | 30.0 | 0.0 | 4.0 |
| 2 | Very Good | Within the past year | Yes | No | No | No | No | Yes | No | Female | 60-64 | 163.0 | 88.45 | 33.47 | No | 4.0 | 12.0 | 3.0 | 16.0 |
| 3 | Poor | Within the past year | Yes | Yes | No | No | No | Yes | No | Male | 75-79 | 180.0 | 93.44 | 28.73 | No | 0.0 | 30.0 | 30.0 | 8.0 |
| 4 | Good | Within the past year | No | No | No | No | No | No | No | Male | 80+ | 191.0 | 88.45 | 24.37 | Yes | 0.0 | 8.0 | 4.0 | 0.0 |
We'll inspect each variable individually to understand its distribution and potential outliers. This will provide insights into the characteristics of each variable and help identify any extreme values or anomalies.
We'll explore the relationship between each variable and the target variables (Heart_Disease, Skin_Cancer, Other_Cancer, Diabetes). This analysis will allow us to understand how each variable is associated with the presence or absence of these diseases. We can use techniques like bar charts to visualize the distributions of the target variables based on different categories or levels of other variables.
We'll study the interactions between different variables and how they collectively relate to the target variables. This analysis will help us uncover complex relationships and patterns that may not be apparent in the univariate or bivariate analyses. Techniques such as scatter plots, correlation matrices, and 3D visualizations can be utilized to gain deeper insights into the data.
Numerical_Features=['Height_(cm)','Weight_(kg)','BMI','Alcohol_Consumption','Fruit_Consumption','Green_Vegetables_Consumption','FriedPotato_Consumption','Smoking_History']
for i in Numerical_Features:
plt.figure(figsize=(10,4))
sns.histplot(data=cardio_df,x=i)
plt.title('Distribution of '+ i)
The Height of the patients seems to follow a normal distribution , with the majority of the patients having heights around 160-180(cm).
The weight of the patients also appears to be normally distributed, with most patients weighing between approximately 60 and 100 kg.
The BMI of the patients is slighlty right skewed.A large number of patients have a BMI between 20 and 30, which falls within the normal to overweight range. However, there are also a significant number of patients with a BMI in the obese range (>30).
This features is heavily right-skewed.Most patients have low alcohol consumption , but there are few patients with high consumption.
This features is also right-skewed. A lot of patients consume fruits regularly,but a significant number consume them less frequency.
This feature appears to be normally distributed, with most patients consuming green vegetables moderately.
This feature is right-skewed. Many patients consume fried potatoes less frequently, while a few consume them more often.
# Check the distribution of categorical features
categorical_features = ['General_Health', 'Checkup', 'Exercise', 'Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Depression', 'Diabetes', 'Arthritis', 'Sex', 'Age_Category', 'Smoking_History']
for feature in categorical_features:
plt.figure(figsize=(10, 4))
sns.countplot(data=cardio_df, x=feature)
plt.title('Count of ' + feature)
plt.xticks(rotation=90)
plt.show()
Most patients describe their general health as "Good", with "Very Good" being the second most common response. Fewer patients rate their health as "Fair" or "Poor".
The majority of patients had a checkup within the past year. Fewer patients had their last checkup 2 years ago or more than 5 years ago.
More patients reported that they exercise compared to those who do not.
A significant majority of patients do not have heart disease. Only a small proportion of patients have heart disease.
The vast majority of patients do not have skin cancer.
Similar to skin cancer, most patients do not have other forms of cancer.
Most patients do not suffer from depression. However, a non-trivial number of patients do report having depression.
Similar to the disease-related features above, most patients do not have diabetes. However, a small proportion do have diabetes.
Most patients do not have arthritis, but a significant number do.
There are slightly more female patients than male patients in the dataset.
The dataset includes patients from a wide range of age categories. The 65-69 age category has the most patients, followed by the 60-64 and 70-74 categories.
selected_variables = ['General_Health', 'Exercise', 'Sex', 'Age_Category', 'Smoking_History']
disease_conditions = ['Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Diabetes', 'Arthritis']
for disease in disease_conditions:
for variable in selected_variables:
plt.figure(figsize=(10, 4))
sns.countplot(data=cardio_df, x=variable, hue=disease)
plt.title('Relationship between ' + variable + ' and ' + disease)
plt.xticks(rotation=90)
plt.show()
cardio_df.head()
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Poor | Within the past 2 years | No | No | No | No | No | No | Yes | Female | 70-74 | 150.0 | 32.66 | 14.54 | Yes | 0.0 | 30.0 | 16.0 | 12.0 |
| 1 | Very Good | Within the past year | No | Yes | No | No | No | Yes | No | Female | 70-74 | 165.0 | 77.11 | 28.29 | No | 0.0 | 30.0 | 0.0 | 4.0 |
| 2 | Very Good | Within the past year | Yes | No | No | No | No | Yes | No | Female | 60-64 | 163.0 | 88.45 | 33.47 | No | 4.0 | 12.0 | 3.0 | 16.0 |
| 3 | Poor | Within the past year | Yes | Yes | No | No | No | Yes | No | Male | 75-79 | 180.0 | 93.44 | 28.73 | No | 0.0 | 30.0 | 30.0 | 8.0 |
| 4 | Good | Within the past year | No | No | No | No | No | No | No | Male | 80+ | 191.0 | 88.45 | 24.37 | Yes | 0.0 | 8.0 | 4.0 | 0.0 |
cardio_df['General_Health'].unique()
array(['Poor', 'Very Good', 'Good', 'Fair', 'Excellent'], dtype=object)
General_Health_mapping={'Poor':0,'Good':2,'Very Good':3,'Fair':1,'Excellent':4}
cardio_df['General_Health']=cardio_df['General_Health'].map(General_Health_mapping)
checkup_mapping={'Within the past 2 years': 0, 'Within the past year':1,
'5 or more years ago': 2, 'Within the past 5 years' : 3 , 'Never': 4}
cardio_df['Checkup']=cardio_df['Checkup'].map(checkup_mapping)
Gender_mapping={'Female':0,'Male':1}
cardio_df['Sex']=cardio_df['Sex'].map(Gender_mapping)
Age_category_mapping={'70-74':10, '60-64':8, '75-79':11, '80+':12, '65-69':9, '50-54':6, '45-49':5,
'18-24':0, '30-34':2, '55-59':7, '35-39':3, '40-44':4, '25-29':1}
cardio_df['Age_Category']=cardio_df['Age_Category'].map(Age_category_mapping)
listt=['Exercise','Heart_Disease','Skin_Cancer','Other_Cancer','Depression','Diabetes','Arthritis','Smoking_History']
for i in listt:
cardio_df[i]=cardio_df[i].map({'Yes':1 ,'No': 0})
cardio_df.head()
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 1 | 0 | 10 | 150.0 | 32.66 | 14.54 | 1 | 0.0 | 30.0 | 16.0 | 12.0 |
| 1 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 1.0 | 0 | 0 | 10 | 165.0 | 77.11 | 28.29 | 0 | 0.0 | 30.0 | 0.0 | 4.0 |
| 2 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 | 8 | 163.0 | 88.45 | 33.47 | 0 | 4.0 | 12.0 | 3.0 | 16.0 |
| 3 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1.0 | 0 | 1 | 11 | 180.0 | 93.44 | 28.73 | 0 | 0.0 | 30.0 | 30.0 | 8.0 |
| 4 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 1 | 12 | 191.0 | 88.45 | 24.37 | 1 | 0.0 | 8.0 | 4.0 | 0.0 |
cardio_df=cardio_df.drop_duplicates()
data=cardio_df.corr()
data
| General_Health | Checkup | Exercise | Heart_Disease | Skin_Cancer | Other_Cancer | Depression | Diabetes | Arthritis | Sex | Age_Category | Height_(cm) | Weight_(kg) | BMI | Smoking_History | Alcohol_Consumption | Fruit_Consumption | Green_Vegetables_Consumption | FriedPotato_Consumption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| General_Health | 1.000000 | 0.017505 | 0.276080 | -0.232484 | -0.047079 | -0.145614 | -0.207533 | -0.278653 | -0.265911 | 0.018939 | -0.167350 | 0.066930 | -0.184197 | -0.246444 | -0.167538 | 0.118333 | 0.102602 | 0.119738 | -0.031816 |
| Checkup | 0.017505 | 1.000000 | -0.000230 | -0.022603 | -0.031536 | -0.029821 | -0.017365 | -0.037276 | -0.057399 | 0.062261 | -0.087853 | 0.049326 | 0.006746 | -0.018371 | 0.020790 | 0.014921 | -0.028077 | -0.024201 | 0.029969 |
| Exercise | 0.276080 | -0.000230 | 1.000000 | -0.096321 | -0.003963 | -0.054363 | -0.084673 | -0.146645 | -0.124785 | 0.059355 | -0.122334 | 0.091622 | -0.090121 | -0.155732 | -0.093241 | 0.095028 | 0.136782 | 0.124983 | -0.036904 |
| Heart_Disease | -0.232484 | -0.022603 | -0.096321 | 1.000000 | 0.090835 | 0.092369 | 0.032494 | 0.185341 | 0.153891 | 0.072606 | 0.229027 | 0.015783 | 0.045854 | 0.042642 | 0.107757 | -0.036614 | -0.020045 | -0.024027 | -0.009249 |
| Skin_Cancer | -0.047079 | -0.031536 | -0.003963 | 0.090835 | 1.000000 | 0.150781 | -0.013041 | 0.039286 | 0.136146 | 0.009658 | 0.272075 | 0.006799 | -0.028986 | -0.037647 | 0.032793 | 0.042734 | 0.024143 | 0.012894 | -0.038945 |
| Other_Cancer | -0.145614 | -0.029821 | -0.054363 | 0.092369 | 0.150781 | 1.000000 | 0.015861 | 0.072281 | 0.129320 | -0.042061 | 0.234464 | -0.043476 | -0.021169 | 0.001015 | 0.053390 | -0.008704 | 0.007992 | -0.003215 | -0.033326 |
| Depression | -0.207533 | -0.017365 | -0.084673 | 0.032494 | -0.013041 | 0.015861 | 1.000000 | 0.047341 | 0.121562 | -0.141457 | -0.103195 | -0.091315 | 0.047904 | 0.109557 | 0.100215 | -0.028200 | -0.039938 | -0.051134 | 0.018108 |
| Diabetes | -0.278653 | -0.037276 | -0.146645 | 0.185341 | 0.039286 | 0.072281 | 0.047341 | 1.000000 | 0.144823 | 0.020362 | 0.216574 | -0.017735 | 0.174925 | 0.210490 | 0.059028 | -0.116204 | -0.022296 | -0.032131 | -0.003678 |
| Arthritis | -0.265911 | -0.057399 | -0.124785 | 0.153891 | 0.136146 | 0.129320 | 0.121562 | 0.144823 | 1.000000 | -0.100047 | 0.370996 | -0.097794 | 0.074068 | 0.137924 | 0.123128 | -0.024968 | -0.001983 | -0.018803 | -0.050994 |
| Sex | 0.018939 | 0.062261 | 0.059355 | 0.072606 | 0.009658 | -0.042061 | -0.141457 | 0.020362 | -0.100047 | 1.000000 | -0.060234 | 0.698129 | 0.353989 | 0.010978 | 0.073407 | 0.129311 | -0.092486 | -0.069169 | 0.130049 |
| Age_Category | -0.167350 | -0.087853 | -0.122334 | 0.229027 | 0.272075 | 0.234464 | -0.103195 | 0.216574 | 0.370996 | -0.060234 | 1.000000 | -0.120922 | -0.062308 | -0.007426 | 0.133155 | 0.012833 | 0.043661 | 0.036030 | -0.142761 |
| Height_(cm) | 0.066930 | 0.049326 | 0.091622 | 0.015783 | 0.006799 | -0.043476 | -0.091315 | -0.017735 | -0.097794 | 0.698129 | -0.120922 | 1.000000 | 0.472175 | -0.027413 | 0.051762 | 0.128850 | -0.045925 | -0.030153 | 0.108790 |
| Weight_(kg) | -0.184197 | 0.006746 | -0.090121 | 0.045854 | -0.028986 | -0.021169 | 0.047904 | 0.174925 | 0.074068 | 0.353989 | -0.062308 | 0.472175 | 1.000000 | 0.859702 | 0.047481 | -0.032427 | -0.090611 | -0.075895 | 0.096327 |
| BMI | -0.246444 | -0.018371 | -0.155732 | 0.042642 | -0.037647 | 0.001015 | 0.109557 | 0.210490 | 0.137924 | 0.010978 | -0.007426 | -0.027413 | 0.859702 | 1.000000 | 0.024794 | -0.108750 | -0.076603 | -0.070629 | 0.048343 |
| Smoking_History | -0.167538 | 0.020790 | -0.093241 | 0.107757 | 0.032793 | 0.053390 | 0.100215 | 0.059028 | 0.123128 | 0.073407 | 0.133155 | 0.051762 | 0.047481 | 0.024794 | 1.000000 | 0.100553 | -0.093626 | -0.034371 | 0.035824 |
| Alcohol_Consumption | 0.118333 | 0.014921 | 0.095028 | -0.036614 | 0.042734 | -0.008704 | -0.028200 | -0.116204 | -0.024968 | 0.129311 | 0.012833 | 0.128850 | -0.032427 | -0.108750 | 0.100553 | 1.000000 | -0.012542 | 0.060088 | 0.020503 |
| Fruit_Consumption | 0.102602 | -0.028077 | 0.136782 | -0.020045 | 0.024143 | 0.007992 | -0.039938 | -0.022296 | -0.001983 | -0.092486 | 0.043661 | -0.045925 | -0.090611 | -0.076603 | -0.093626 | -0.012542 | 1.000000 | 0.270426 | -0.060302 |
| Green_Vegetables_Consumption | 0.119738 | -0.024201 | 0.124983 | -0.024027 | 0.012894 | -0.003215 | -0.051134 | -0.032131 | -0.018803 | -0.069169 | 0.036030 | -0.030153 | -0.075895 | -0.070629 | -0.034371 | 0.060088 | 0.270426 | 1.000000 | 0.003209 |
| FriedPotato_Consumption | -0.031816 | 0.029969 | -0.036904 | -0.009249 | -0.038945 | -0.033326 | 0.018108 | -0.003678 | -0.050994 | 0.130049 | -0.142761 | 0.108790 | 0.096327 | 0.048343 | 0.035824 | 0.020503 | -0.060302 | 0.003209 | 1.000000 |
a=data['General_Health']
b=pd.DataFrame(a)
b
| General_Health | |
|---|---|
| General_Health | 1.000000 |
| Checkup | 0.017505 |
| Exercise | 0.276080 |
| Heart_Disease | -0.232484 |
| Skin_Cancer | -0.047079 |
| Other_Cancer | -0.145614 |
| Depression | -0.207533 |
| Diabetes | -0.278653 |
| Arthritis | -0.265911 |
| Sex | 0.018939 |
| Age_Category | -0.167350 |
| Height_(cm) | 0.066930 |
| Weight_(kg) | -0.184197 |
| BMI | -0.246444 |
| Smoking_History | -0.167538 |
| Alcohol_Consumption | 0.118333 |
| Fruit_Consumption | 0.102602 |
| Green_Vegetables_Consumption | 0.119738 |
| FriedPotato_Consumption | -0.031816 |
sns.heatmap(data,cmap="coolwarm")
<Axes: >
#### disease_variables = ['Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Diabetes']
a=data['Heart_Disease']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Heart Disease')
Text(0.5, 1.0, 'Correlation with Heart Disease')
a=data['Skin_Cancer']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Skin Cancer')
Text(0.5, 1.0, 'Correlation with Skin Cancer')
a=data['Other_Cancer']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Other Cancer')
Text(0.5, 1.0, 'Correlation with Other Cancer')
a=data['Diabetes']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Diabetes')
Text(0.5, 1.0, 'Correlation with Diabetes')
This condition shows a strong positive correlation with Age_Category and General_Health, and a negative correlation with Exercise and Sex_Female.
This condition is strongly positively correlated with Age_Category and Sex_Male, and negatively correlated with Sex_Female.
This condition shows a strong positive correlation with Age_Category and General_Health, and a negative correlation with Sex_Female.
This condition shows a strong positive correlation with Age_Category, General_Health, and BMI, and a negative correlation with Exercise.
cardio_df.isnull().sum()
General_Health 0 Checkup 0 Exercise 0 Heart_Disease 0 Skin_Cancer 0 Other_Cancer 0 Depression 0 Diabetes 9542 Arthritis 0 Sex 0 Age_Category 0 Height_(cm) 0 Weight_(kg) 0 BMI 0 Smoking_History 0 Alcohol_Consumption 0 Fruit_Consumption 0 Green_Vegetables_Consumption 0 FriedPotato_Consumption 0 dtype: int64
cardio_df.dropna(inplace=True)
cardio_df.isnull().sum()
General_Health 0 Checkup 0 Exercise 0 Heart_Disease 0 Skin_Cancer 0 Other_Cancer 0 Depression 0 Diabetes 0 Arthritis 0 Sex 0 Age_Category 0 Height_(cm) 0 Weight_(kg) 0 BMI 0 Smoking_History 0 Alcohol_Consumption 0 Fruit_Consumption 0 Green_Vegetables_Consumption 0 FriedPotato_Consumption 0 dtype: int64
# Outliers in dataset
Numeric_Values=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption',
'Fruit_Consumption', 'Green_Vegetables_Consumption',
'FriedPotato_Consumption']
for i in Numeric_Values:
sns.boxplot(cardio_df[i])
plt.title('Outliers in ' + i)
plt.show()
The minimum value is 91 cm, and the maximum is 241 cm. These could be extreme cases, but they're worth investigating further. 📏
The maximum weight is 293.02 kg, which seems quite high. This could potentially be an outlier or extreme value. ⚖️
The maximum BMI is 99.33, which is very high, even for extreme cases of obesity. This might indicate data entry errors. 🍔
The maximum value is 30, which seems quite high. We need to understand the measurement units to interpret whether this is an outlier or not. 🍺
The maximum values seem quite high, but it depends on the measurement units (for example, servings per week/month). 🍎🥦🍟
def outliers_treatment(column):
sorted(column)
q3,q1=np.percentile(column,[25,75])
IQR=q3-q1
lower_limit= q1 - (1.5*IQR)
upper_range = q3 + (1.5*IQR)
return lower_limit,upper_range
treatment_list=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption',
'Fruit_Consumption', 'Green_Vegetables_Consumption',
'FriedPotato_Consumption']
for i in treatment_list:
l,u=outliers_treatment(cardio_df[i])
datta=cardio_df[(cardio_df[i]<u) | (cardio_df[i]>l)]
dattta=datta.index
cardio_df.drop(dattta,inplace=True)
Numeric_Values=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption',
'Fruit_Consumption', 'Green_Vegetables_Consumption',
'FriedPotato_Consumption']
for i in Numeric_Values:
sns.boxplot(cardio_df[i])
plt.title('Outliers in ' + i)
plt.show()